Library Imports
from pyspark.sql import SparkSession
from pyspark.sql import types as T
from pyspark.sql import functions as F
from datetime import datetime
from decimal import Decimal
Template
spark = (
SparkSession.builder
.master("local")
.appName("Section 2.3 - Creating New Columns")
.config("spark.some.config.option", "some-value")
.getOrCreate()
)
sc = spark.sparkContext
import os
data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
Creating New Columns and Transforming Data
When we are data wrangling, transforming data, we will using assign the result to a new column. We will explore the withColumn()
function and other transformation functions to achieve this our end results.
We will also look into how we can rename a column with withColumnRenamed()
, this is useful for making a join on the same column
, etc.
Case 1: New Columns - withColumn()
(
pets
.withColumn('nickname_copy', F.col('nickname'))
.withColumn('nickname_capitalized', F.upper(F.col('nickname')))
.toPandas()
)
id | breed_id | nickname | birthday | age | color | nickname_copy | nickname_capitalized | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown | King | KING |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None | Argus | ARGUS |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None | Chewie | CHEWIE |
What Happened?
We duplicated the nickname
column as nickname_copy
using the withColumn()
function. We also created a new column where all the letters of the nickname
are capitalized
with chaining multiple spark functions together.
We will look into more advanced column creation in the next section. There we will go into more details what a column expression
is and what the purpose of F.col()
is.
Case 2: Renaming Columns - withColumnRenamed()
(
pets
.withColumnRenamed('id', 'pet_id')
.toPandas()
)
pet_id | breed_id | nickname | birthday | age | color | |
---|---|---|---|---|---|---|
0 | 1 | 1 | King | 2014-11-22 12:30:31 | 5 | brown |
1 | 2 | 3 | Argus | 2016-11-22 10:05:10 | 10 | None |
2 | 3 | 1 | Chewie | 2016-11-22 10:05:10 | 15 | None |
What Happened?
We renamed and replaced the id
column with pet_id
.
Summary
- We learned how to create new columns from old ones by chaining spark functions and using
withColumn()
. - We learned how to rename columns using
withColumnRenamed()
.